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Abstract 

In the general case, a trilinear relationship between three perspective views is shown to exist. The trilinearity 
result is shown to be of much practical use in visual recognition by alignment — yielding a direct method 
that cuts through the computations of camera transformation, scene structure and epipolar geometry. 
The proof of the central result may be of further interest as it demonstrates certain regularities across 
homographies of the plane and introduces new view invariants. Experiments on simulated and real image 
data were conducted, including a comparative analysis with epipolar intersection and the linear combination 
methods, with results indicating a greater degree of robustness in practice and a higher level of performance 
in re-projection tasks. 
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1 Introduction 

We establish a general result about algebraic connections 
across three perspective views of a 3D scene and demon- 
strate its application to visual recognition via alignment. 
We show that, in general, any three perspective views of 
a scene satisfy a pair of trilinear functions of image co- 
ordinates. In the limiting case, when all three views are 
orthographic, these functions become linear and reduce 
to the form discovered by [34]. Using the trilinear result 
one can manipulate views of an object (such as generate 
novel views from two model views) without recovering 
scene structure (metric or non-metric), camera transfor- 
mation, or even the epipolar geometry. 

The central results in this paper are contained in The- 
orems 1 and 2. The first theorem states that the vari- 
ety of views ip of a fixed 3D object obtained by an un- 
calibrated pin-hole camera satisfy a relation of the sort 
F(ip,ipi,ip2) = 0, where ipi,ij)2 are two arbitrary views 
of the object, and F has a special trilinear form. The 
coefficients of F can be recovered linearly without es- 
tablishing first the epipolar geometry, 3D structure of 
the object, or camera motion. The auxiliary Lemmas 
required for the proof of Theorem 1 may be of interest 
on their own as they establish certain regularities across 
projective transformations of the plane and introduce 
new view invariants (Lemma 4). 

Theorem 2 is an obvious corollary of Theorem 1 but 
contains a significant practical aspect. It is shown that 
if the views ipi,ij)2 are obtained by parallel projection, 
then F reduces to a special bilinear form — or, equiva- 
lently, that any perspective view ip can be obtained by a 
rational linear function of two orthographic views. The 
reduction to a bilinear form implies that simpler recog- 
nition schemes are possible if the two reference views 
(model views) stored in memory are orthographic. 

These two results may have several applications (dis- 
cussed in Section 6), but the one emphasized throughout 
this paper is for the task of recognition of 3D objects 
via alignment. The alignment approach for recognition 
([33, 16], and references therein) is based on the result 
that the equivalence class of views of an object (ignor- 
ing self occlusions) undergoing 3D rigid, affine or pro- 
jective transformations can be captured by storing a 3D 
model of the object, or simply by storing at least two 
arbitrary "model" views of the object — assuming that 
the correspondence problem between the model views 
can somehow be solved (cf. [25, 5, 29]). During recog- 
nition a small number of corresponding points between 
the novel input view and the model views of a particular 
candidate object are sufficient to "re-project" the model 
onto the novel viewing position. Recognition is achieved 
if the re-projected image is successfully matched against 
the input image. We refer to the problem of predicting 
a novel view from a set of model views using a limited 
number of corresponding points, as the problem of re- 
projection. 

The problem of re-projection can in principal be dealt 
with via 3D reconstruction of shape and camera motion. 
This includes classical structure from motion methods 
for recovering rigid camera motion parameters and met- 
ric shape [32, 18, 31, 14, 15], and more recent meth- 



ods for recovering non-metric structure, i.e., assuming 
the objects undergo 3D affine or projective transforma- 
tions, or equivalently, that the cameras are uncalibrated 
[17, 23, 35, 10, 13, 27, 28]. The classic approaches for 
perspective views are known to be unstable under errors 
in image measurements, narrow field of view, and inter- 
nal camera calibration [3, 9, 12], and therefore, are un- 
likely to be of practical use for purposes of re-projection. 
The non-metric approaches, as a general concept, have 
not been fully tested on real images, but the methods 
proposed so far rely on recovering first the epipolar ge- 
ometry — a process that is also known to be unstable in 
the presence of noise. 

It is also known that the epipolar geometry is by itself 
sufficient to achieve re-projection by means of intersect- 
ing epipolar lines [22, 6, 8, 24, 21, 11]. This, however, 
is possible only if the centers of the three cameras are 
non-collinear — which can lead to numerical instability 
unless the centers are far from collinear — and any ob- 
ject point on the tri-focal plane cannot be re-projected 
as well. Furthermore, as with the non-metric reconstruc- 
tion methods, obtaining the epipolar geometry is at best 
a sensitive process even when dozens of corresponding 
points are used and with the state of the art methods 
(see Section 5 for more details and comparative analysis 
with simulated and real images). 

For purposes of stability, therefore, it is worthwhile 
exploring more direct tools for achieving re-projection. 
For instance, instead of reconstruction of shape and in- 
variants we would like to establish a direct connection 
between views expressed as a functions of image coor- 
dinates alone — which we will call "algebraic functions 
of views" . Such a result was established in the ortho- 
graphic case by [34]. There it was shown that any three 
orthographic views of an object satisfy a linear function 
of the corresponding image coordinates — this we will 
show here is simply a limiting case of larger set of al- 
gebraic functions, that in general have a trilinear form. 
With these functions one can manipulate views of an 
object, such as create new views, without the need to 
recover shape or camera geometry as an intermediate 
step — all what is needed is to appropriately combine 
the image coordinates of two reference views. Also, with 
these functions, the epipolar geometries are intertwined, 
leading not only to absence of singularities, but as we 
shall see in the experimental section to more accurate 
performance in the presence of errors in image measure- 
ments. 

2 Notations 

We consider object space to be the three-dimensional 
projective space V 3 , and image space to be the two- 
dimensional projective space V 2 . Let $ C V 3 be a set of 
points standing for a 3D object, and let ipi C V 2 denote 
views (arbitrary), indexed by i, of $. Given two cam- 
eras with centers located at O, O' G V 3 , respectively, the 
epipoles are defined to be at the intersection of the line 
00' with both image planes. Because the image plane is 
finite, we can assign, without loss of generality, the value 
1 as the third homogeneous coordinate to every observed 
image point. That is, if [x, y) are the observed image co- 



ordinates of some point (with respect to some arbitrary 
origin — say the geometric center of the image), then 
p = (x,y,l) denotes the homogeneous coordinates of 
the image plane. Since we will be working with at most 
three views at a time, we denote the relevant epipoles 
as follows: let v £ ip\ and v' £ i\> 2 be the corresponding 
epipoles between views t/>i,t/>2, an d let v £ ip\ and v" £ 
t/>3 the corresponding epipoles between views t/>i,t/>3. 
Likewise, corresponding image points across three views 
will be denoted by p = (x,y,l),p' = (x',y',l) and 
p" = (x",y", 1). The term "image coordinates" will de- 
note the non-homogeneous coordinate representation of 
V 2 , e.g., (x, y), (x' , j/'), (x" , y") for the three correspond- 
ing points. 

Planes will be denoted by 7r 8 ', indexed by i, and just 7r 
if only one plane is discussed. All planes are assumed to 
be arbitrary and distinct from one another. The symbol 
= denotes equality up to a scale, GL n stands for the 
group ofnxn matrices, and PGL n is the group defined 
up to a scale. 

A coordinate representation 1Z of V 3 is a tetrad of 
coordinates \z , z\, z 2 , 23] such that if IZq is any one al- 
lowable representation, the whole class 1Z consists of all 
those representations that can be obtained from TZo by 
the action of the group PGL4. Given a set of views t/> 8 -, 
i = 1, 2, ..., of $, where coordinates on i\>\ are [x, y, 1] and 
IZo is a representation for which (20,21,22) = (*,£/, 1), 
we will say that the object is undergoing at most 3D 
relative affine transformations between views if the class 
of representations 1Z consists of all those representations 
that can be obtained from TZo by the action of an affine 
subgroup of PGL4. In other words, the object undergoes 
some projective transformation and projected onto the 
view t/>i , after which all other transformations applied to 
$ are affine. Note that this definition is general and al- 
lows full uncalibrated pin-hole camera motion (for more 
details on uncalibrated camera motion versus relative 
affine transformation versus taking pictures of pictures 
of the scene, see Appendix of [26]). 

3 The Trilinear Form 

The central result of this paper is presented in the fol- 
lowing theorem. The remaining of the section is devoted 
to the proof of this result and its implications. 

Theorem 1 (Trilinearity) Let t/>i, ip 2 , V"3 be three ar- 
bitrary perspective views of some object, modeled by a set 
of points in 3D, undergoing at most a 3D relative affine 
transformations between views. The image coordinates 
(x,y) £ tpi, (x',y') £ ip 2 and (x",y") £ tps of three 
corresponding points across three views satisfy a pair of 
trilinear equations of the following form: 

x" {cx\x + a 2 y + as) + x' ' x '(a^x + a^y -\- ae)+ 
x'(a 7 x + a s y + a 9 ) + a w x + any + a V2 = 0, 



where the coefficients aj, /3j , j = 1, ..., 12, are fixed for 
all points, are uniquely defined up to an overall scale, 
and aj = /3j , j = 1, ..., 6. 

The following auxiliary propositions are used as part of 
the proof. 

Lemma 1 (Auxiliary - Existence) Let A £ PGL3 

be the projective mapping (homography) i\>\ 1— ► ip 2 due to 
some plane tt. Let A be scaled to satisfy p' = Ap + v' , 
where p £ i\>\ and p' g £ t/>2 are corresponding points 
coming from an arbitrary point P ^ 7r. Then, for any 
corresponding pair p £ i\>\ and p' £ t/>2 coming from an 
arbitrary point P £ V 3 , we have 

p' = Ap + kv' . 

The coefficient k is independent of t/% «-e., is invariant 
to the choice of the second view. 

The lemma, its proof and its theoretical and practical 
implications are discussed in detail in [26]. Note that 
the particular case where the homography A is affine, 
and the epipole v' is on the line at infinity, corresponds 
to the construction of affine structure from two ortho- 
graphic views [17]. The scalar k is called a relative affine 
invariant and represents the ratio of the distance of P 
from 7r along the line of sight, and the distance of P 
from the camera center of t/>i, normalized by the ratio 
of distances of P from the plane and the camera center. 
This normalized ratio can be computed with the aid of 
a second arbitrary view t/>2- 

Definition 1 Homographies A{ £ PGL3 from i\>\ 1— ► t/> 8 - 
due to the same plane tt, are said to be scale-compatible 
if they are scaled to satisfy Lemma 1, i.e., for any point 
P £ $ projecting onto p £ i\>\ and p % £ ipi, there exists a 
scalar k that satisfies 

Aip + kv\ 

£ ipi is the epipole with i\>\ 



p 

for any view ipi, where v 
(scaled arbitrarily). 

Lemma 2 (Auxiliary — Uniqueness) 

PGL3 be two homographies of i\>\ 1— ► t/>2 
7Ti,7T2, respectively. Then, there exists a sea 
satisfies the equation: 

A - sA' = [av',/3v',yv'], 

for some coefficients a, /?, 7. 

Proof. Let q £ i\>\ be any point in the first 
There exists a scalar s„ that satisfies v 



Let A, A' £ 

lue to planes 

r s, that 



Aq - s q A'q. 



and 



y"{(3ix + /3 2 y + /3 3 ) + y"x'(/3 4 x + /3 5 y + /3 6 )+ 
x'(j3 7 x + /3 8 y + /?£,) + /3 w x + /3 n y + /3 12 = 0, 



Let H = A — s q A' , and we have Hq = v' . But, as shown 
in [27], Av = v' for any homography i\>\ 1— ► i\> 2 due to any 
plane. Therefore, Hv = v' as well. The mapping of two 
distinct points q, v onto the same point v' could happen 
only if Hp = v' for all p £ t/>i, and s q is a fixed scalar s. 
This, in turn, implies that H is a matrix whose columns 
are multiples of v' . V\ 

Lemma 3 (Auxiliary for Lemma 4) Let A, A' £ 

PGL3 be homographies from i\>\ 1— ► i\> 2 due to distinct 
planes 7Ti,7T2, respectively, and B,B' £ PGL3 be homo- 
graphies from tpi h- ► t/>3 due to 7Ti , 7T2, respectively. Then, 
A 1 = AT for some T £ PGL 3 , and B = BCTC~ l , 
where Cv = v. 



Proof. Let A = A7 A\ , where A\ , A 2 are homo- 
graphies from i\>\,i\> 2 onto 7i"i, respectively. Similarly 
B = B 2 B\ , where B\ , B 2 are homographies from t/>i , ip 3 
onto 7Ti, respectively. Let A\v = (ci,C2,C3) T , and let 
C = j4j~ diag{c\, c 2 , c 3 )Ai. Then, B\ = A\C~ X , and 
thus, we have 5 = 5^" A\C~ l . Note that the only dif- 
ference between Ai and B\ is due to the different lo- 
cation of the epipoles v,v, which is compensated by C 
(Cv = v). Let E\ G PGL 3 be the homography from i\>\ 
to 7T2, and £"2 G PGL 3 the homography from 7T2 to 7Ti. 
Then with proper scaling of E\ and £"2 we have 

A' = A7 2 1 E 2 E l = AA7 1 E 2 E 1 = AT, 

and with proper scaling of C we have, 

B' = B7 l E 2 E 1 C~ l = BCA7 1 E 2 E 1 C~ 1 = BCTC' 1 . 

D 

Lemma 4 (Auxiliary — Uniqueness) 

For scale-compatible homographies, the scalars s,a,/3,y 
of Lemma 2 are invariants indexed by V'i; 7r i; 7r 2- That 
is, given an arbitrary third view ip 3 , let B, B' be the ho- 
mographies from tpi h- ► t/>3 due to 7Ti,7T2, respectively. Let 
B be scale- compatible with A, and B' be scale- compatible 
with A' . Then, 

B-sB' = [av",l3v",jv"}. 

Proof. We show first that s is invariant, i.e., that B — 
sB' is a matrix whose columns are multiples of v" . From 
Lemma 2, and Lemma 3 there exists a matrix H , whose 
columns are multiples of v' , a matrix T that satisfies 
A' = AT, and a scalar s such that L — sT = A~ l H . After 
multiplying both sides by BC , and then pre-multiplying 
by C~ l we obtain 

B - sBCTC' 1 = BCA- l HC~ l . 

From Lemma 3, we have B' = BCTC~ l . The ma- 
trix A~ l H has columns which are multiples of v (be- 
cause A~ 1 v' = v), CA~ l H is a matrix whose columns 
are multiple of v, and BCA~ l H is a matrix whose 
columns are multiples of v" . Pre-multiplying BCA~ l H 
by C~ l does not change its form because every column 
of BCA~ l HC~ l is simply a linear combination of the 
columns of BCA~ l H. As a result, B — sB' is a matrix 
whose columns are multiples of v" . 

Let H = A—sA' and H = B — sB'. Since the homogra- 
phies are scale compatible, we have from Lemma 1 the 
existence of invariants k,k' associated with an arbitrary 
P G ipi, where k is due to 7Ti, and k' is due to tt 2 : p' = 
Ap + kv' = A'p + k'v' and p" = Bp + kv" = B'p + k'v". 
Then from Lemma 2 we have Tip = (sk' — k)v' and 
Hp = (sk' — k)v" . Since p is arbitrary, this could hap- 
pen only if the coefficients of the multiples of v' in H 
and the coefficients of the multiples oft;" in H , coincide. 

D 

Proof of Theorem: Lemma 1 provides the existence 
part of theorem, as follows. Since Lemma 1 holds for 
any plane, choose a plane 7Ti and let A, B be the scale- 
compatible homographies i\>\ 1— ► ip 2 and i\>\ 1 — >■ ^3, respec- 
tively. Then, for every point p G ipi, with corresponding 



points p' G ip2,p" £ V"3) there exists a scalar k that sat- 
isfies: p' = Ap + kv', and p" = Bp + kv". We can isolate 
k from both equations and obtain: 

111 111 111 

(1) 



(i'a 3 -ai) T p (y'a 3 -a 2 )T p ( x 'a 2 - y 'a 1 )T p ' 
11 11 11 11 11 11 a 11 a 11 

, _ v x -x v 3 _ v 2 -y v 3 _ y v x -x v 2 ,„-. 

( x iib 3 -b 1 )Tp ( y "b 3 -b 2 )T p (x"b 2 -y"b 1 )T p > y > 

where b\,b 2 , 63 and a\,a 2 , as are the row vectors of A 
and B and v' = (y^ , v' 2 , v' 3 ) , v" = (i>", v 2 , i> 3 ' ). Because 
of the invariance of k we can equate terms of Equation 1 
with terms of Equation 2 and obtain trilinear functions 
of image coordinates across three views. For example, 
by equating the first two terms in each of the equations, 
we obtain: 



a:"«&3 " v'ia l ) T p+ x"x'(v£a 3 - v' 3 b 3 ) T p + 
z'^bi - v'la 3 ) T p+ «ai - v'^fp = 0, (3) 

In a similar fashion, after equating the first term of Equa- 
tion 1 with the second term of Equation 2, we obtain: 

£/'(„; 63 - vii ai ) T p + y"x'{v'ia 3 - v' 3 b 3 ) T p + 
x'(v' 3 b 2 - v'Ja 3 ) T p + (u'j'ai - v[b 2 ) T p = 0. (4) 

Both equations are of the desired form, with the first six 
coefficients identical across both equations. 

The question of uniqueness arises because Lemma 1 
holds for any plane. If we choose a different plane, say 
7T2, with homographies A' , B' , then we must show that 
the new homographies give rise to the same coefficients 
(up to an overall scale). The parenthesized terms in 
Equations 3 and 4 have the general form: v'-bi ± v'/aj, 
for some i and j. Thus, we need to show that there exists 
a scalar s that satisfies 

v"(aj — sa'j) = v'j(bi — sb'f). 

This, however, follows directly from Lemmas 2 and 4. [1 
The direct implication of the theorem is that one can 
generate a novel view (^3) by simply combining two 
model views (tl>i,tl> 2 ). The coefficients oij and /3j of the 
combination can be recovered together as a solution of 
a linear system of 17 equations (24 — 6 — 1) given nine 
corresponding points across the three views (more than 
nine points can be used for a least-squares solution). 

Taken together, Equations 1 and 2 lead to 9 algebraic 
functions of three views, six of which are separate for x" 
and y" . The other four functions are listed below: 

X "(.) + X "yi(.) + yi(.) + (.) = Q, 

!/"(•) + 2/VO + !/(•) + (0 = 0, 

X " X '(.) + X "y'(.) + X '(.) + y'(.) = 0, 

!/V(.) + !/V(-) + ^'(-) + !/(-) = 0, 

where (•) represent linear polynomials in x, y. The solu- 
tion for x",y" is unique without constraints on the al- 
lowed camera transformations. If we choose Equations 3 
and 4, then v[ and v' 3 should not vanish simultaneously, 
i.e., v' = (0, 1,0) is a singular case. Also v" = (0, 1,0) 
and v" = (1, 0, 0) give rise to singular cases. One can eas- 
ily show that for each singular case there are two other 
functions out of the nine available ones that provide a 



unique solution for x",y". Note that the singular cases 
are pointwise, i.e., only three epipolar directions are ex- 
cluded, compared to the more wide-spread singular cases 
that occur with epipolar intersection, as described in the 
introduction. 

In practical terms, the process of generating a novel 
view can be easily accomplished without the need to ex- 
plicitly recover structure, camera transformation, or just 
the epipolar geometry. The process described here is 
fundamentally different from intersecting epipolar lines 
in the following ways: first, we use the three views to- 
gether, instead of pairs of views separately; second, there 
is no process of line intersection, i.e., the x and y coor- 
dinates of t/>3 are obtained separately as a solution of a 
single equation in coordinates of the other two views; 
and thirdly, the process is well defined in cases where 
intersecting epipolar lines becomes singular (e.g., when 
the three camera centers are collinear). Furthermore, by 
avoiding the need to recover the epipolar geometry we 
obtain a significant practical advantage, since the epipo- 
lar geometry is the most error-sensitive component when 
working with perspective views. 

The connection between the general result of trilinear 
functions of views to the "linear combination of views" 
result [34] for orthographic views, can easily be seen by 
setting A and B to be affine in V 2 , and v' 3 = v 3 = 0. 
For example, Equation 3 reduces to 

v[x" - v'(x' + (V/ai - v'-tWfp = 0, 

which is of the form 

U\x" + U'lX 1 + a 3 x + «4j/ + «5 = 0. 

Thus, in the case where all three views are orthographic, 
x" is expressed as a linear combination of image coordi- 
nates of the two other views — as discovered by [34] . 

4 The Bilinear Form 

Consider the case for which the two reference (model) 
views of an object are taken orthographically (using a 
tele lens would provide a reasonable approximation), but 
during recognition any perspective view of the object is 
allowed. It can easily be shown that the three views are 
then connected via bilinear functions (instead of trilin- 
ear): 

Theorem 2 (Bilinearity) Within the conditions of 
Theorem 1, in case the views i\>\ and t/>2 are obtained 
by parallel projection, then the pair of trilinear forms of 
Theorem 1 reduce to the following pair of bilinear equa- 
tions: 

x h '(aix + cx'iy + a 3 ) + a^x' 1 ' x 1 + a^x' + cxqx + a^y + a$ = 0, 



y"{j3 1 x + j3 2 y + j3 3 ) + j3 A y"x' + j3 5 x' + j3 6 x + j3 7 y + j3 8 = 0, 

where aj = /3j , j = 1, ..., 4. 

Proof. Under these conditions we have from Lemma 1 
that A is affine in V 2 and v' 3 = 0, therefore Equation 3 
reduces to: 

x" {v' l b 3 -v'^a l ) T p+v'^x" x' -v'lx' + {v'la l -v' l b l ) T p = 0. 



Similarly, Equation 4 reduces to: 

y"(v' 1 b 3 -v^a 1 ) T p+v' 3 Yx'-v^x' + (v^a 1 -v' 1 b 2 ) T p = 0. 

Both equations are of the desired form, with the first 
four coefficients identical across both equations. [1 

A bilinear function of three views has two advantages 
over the general trilinear function. First, only six cor- 
responding points (instead of nine) across three views 
are required for solving for the coefficients. Second, the 
lower the degree of the algebraic function, the less sen- 
sitive the solution should be in the presence of errors in 
measuring correspondences. In other words, it is likely 
(though not necessary) that the higher order terms, such 
as the term x"x'x in Equation 3, will have a higher con- 
tribution to the overall error sensitivity of the system. 

Compared to the case when all views are assumed or- 
thographic, this case is much less of an approximation. 
Since the model views are taken only once, it is not un- 
reasonable to require that they be taken in a special 
way, namely, with a tele lens (assuming we are dealing 
with object recognition, rather than scene recognition). 
If that requirement is satisfied, then the recognition task 
is general since we allow any perspective view to be taken 
during the recognition process. 

5 Experimental Data 

The experiments described in this section were done in 
order to evaluate the practical aspect of using the trilin- 
ear result for re-projection compared to using epipolar 
intersection and the linear combination result of [34] (the 
latter we have shown is simply a limiting case of the tri- 
linear result). 

The epipolar intersection method was implemented in 
the following way. Let Fi 3 and F23 be the matrices ("es- 
sential" matrices in classical terminology [18], which we 
adopt here) that satisfy p" F\ 3 p = 0, and p" ' F23P' = 0. 
Then, by incidence of p" with its epipolar line, we have: 

p" = F 13 p x F 23 p'- 

Therefore, given eight corresponding points across the 
three views, we can recover the two essential matrices, 
and then re-project all other object points onto the third 
view. In practice one would use more than eight points 
for recovering the essential matrices in a linear or non- 
linear squares method. Since linear least squares meth- 
ods are still sensitive to image noise, we used the imple- 
mentation of a non-linear method described in [19] which 
was kindly provided by T. Luong and L. Quan. 

The first experiment is with simulation data showing 
that even when the epipolar geometry is recovered accu- 
rately, it is still significantly better to use the trilinear 
result which avoids the process of line intersection. The 
second experiment is done on a real set of images, com- 
paring the performance of the various methods and the 
number of corresponding points that are needed in prac- 
tice to achieve reasonable re-projection results. 

5.1 Computer Simulations 

We used an object of 46 points placed randomly with z 
coordinates between 100 units and 120 units, and x,y 
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Figure 1: Comparing the performance of the epipolar intersection method (the dotted line) and the trilinear functions 
method (dashed line) in the presence of image noise. The graph on the left shows the maximal re-projection error 
averaged over 200 trials per noise level (bars represent standard deviation). Graph on the right displays the average 
re-projection error averaged over all re-projected points averaged over the 200 trials per noise level. 



coordinates ranging randomly between -125 and +125. 
Focal length was of 50 units and the first view was ob- 
tained by fx/z,fy/z. The second view {^2) was gener- 
ated by a rotation around the point (0, 0, 100) with axis 
(0.14, 0.7, 0.7) and by an angle of 0.3 radians. The third 
view (1/13) was generated by a rotation around an axis 
(0,1,0) with the same translation and angle. Various 
amounts of random noise was applied to all points that 
were to be re-projected onto a third view, but not to the 
eight or nine points that were used for recovering the 
parameters (essential matrices, or trilinear coefficients). 
The noise was random, added separately to each coor- 
dinate and with varying levels from 0.5 to 2.5 pixel er- 
ror. We have done 1000 trials as follows: 20 random 
objects were created, and for each degree of error the 
simulation was ran 10 times per object. We collected 
the maximal re-projection error (in pixels) and the av- 
erage re-projection error (averaged of all the points that 
were re-projected). These numbers were collected sepa- 
rately for each degree of error by averaging over all trials 
(200 of them) and recording the standard deviation as 
well. Since no error were added to the eight or nine 
points that were used to determine the epipolar geom- 
etry and the trilinear coefficients, we simply solved the 
associated linear systems of equations required to obtain 
the essential matrices or the trilinear coefficients. 

The results are shown in Figure 1. The graph on 
the left shows the performance of both algorithms for 
each level of image noise by measuring the maximal re- 
projection error. We see that under all noise levels, the 
trilinear method is significantly better and also has a 
smaller standard deviation. Similarly for the average re- 
projection error shown in the graph on the right. 

This difference in performance is expected, as the tri- 
linear method takes all three views together, rather than 
every pair separately, and thus avoiding line intersec- 
tions. 



5.2 Experiments On Real Images 

Figure 2 shows three views of the object we selected for 
the experiment. The object is a sports shoe with added 
texture to facilitate the correspondence process. This 
object was chosen because of its complexity, i.e., it has a 
shape of a natural object and cannot easily be described 
parametrically (as a collection of planes or algebraic sur- 
faces). Note that the situation depicted here is challeng- 
ing because the re-projected view is not in-between the 
two model views, i.e., one should expect a larger sensi- 
tivity to image noise than in-between situations. A set of 
34 points were manually selected on one of the frames, 
t/>i, and their correspondences were automatically ob- 
tained along all other frames used in this experiment. 
The correspondence process is based on an implementa- 
tion of a coarse-to-fine optical-flow algorithm described 
in [7]. To achieve accurate correspondences across dis- 
tant views, intermediate in-between frames were taken 
and the displacements across consecutive frames were 
added. The overall displacement field was then used to 
push ("warp") the first frame towards the target frame 
and thus create a synthetic image. Optical-flow was ap- 
plied again between the synthetic frame and the target 
frame and the resulting displacement was added to the 
overall displacement obtained earlier. This process pro- 
vides a dense displacement field which is then sampled 
to obtain the correspondences of the 34 points initially 
chosen in the first frame. The results of this process are 
shown in Figure 2 by displaying squares centered around 
the computed locations of the corresponding points. One 
can see that the correspondences obtained in this manner 
are reasonable, and in most cases to sub-pixel accuracy. 
One can readily automate further this process by select- 
ing points in the first frame for which the Hessian ma- 
trix of spatial derivatives is well conditioned — similar 
to the confidence values suggested in the implementa- 
tions of [4, 7, 30] — however, the intention here was not 
so much as to build a complete system but to test the 




Figure 2: Top Row: Two model views, i\>\ on the left and t/>2 on the right. The overlayed squares illustrate the 
corresponding points (34 points). Bottom Row: Third view tps. Note that tps is not in-between i\>\ and t/>2, making 
the re-projection problem more challenging (i.e., performance is more sensitive to image noise than in-between 
situations). 




Figure 3: Re-projection onto tps using the trilinear result. The re-projected points are marked as crosses, therefore 

should be at the center of the squares for accurate re-projection. On the left, the minimal number of points were used 

for recovering the trilinear coefficients (nine points); the average pixel error between the true an estimated locations 

is 1.4, and the maximal error is 5.7. On the right 12 points were used in a least squares fit; average error is 0.4 and 

maximal error is 1.4. 

6 




Figure 4: Results of re-projection using intersection of epipolar lines. The re-projected points are marked as crosses, 
therefore should be at the center of the squares for accurate re-projection. In the lefthand display the ground plane 
points were used for recovering the essential matrix (see text), and in the righthand display the essential matrices 
were recovered from the implementation of [19] using all 34 points across the three views. Maximum displacement 
error in the lefthand display is 25.7 pixels and average error is 7.7 pixels. Maximal error in the righthand display is 
43.4 pixels and average error is 9.58 pixels. 



performance of the trilinear re-projection method and 
compare it to the performance of epipolar intersection 
and the linear combination methods. 

The trilinear method requires at least nine correspond- 
ing points across the three views (we need 17 equation, 
and nine points provide 18 equations), whereas epipolar 
intersection can be done (in principle) with eight points. 
The question we are about to address is what is the 
number of points that are required in practice (due to 
errors in correspondence, lens distortions and other ef- 
fects that are not adequately modeled by the pin-hole 
camera model) to achieve reasonable performance? 

The trilinear result was first applied with the minimal 
number of points (nine) for solving for the coefficients, 
and then applied with 12 points using a linear least- 
squares solution. The results are shown in Figure 3. 
Nine points provide a re-projection with maximal error 
of 5.7 pixels and average error of 1.4 pixels. The solution 
using 12 points provided a significant improvement with 
maximal error of 1.4 and average error of 0.4 pixels. Us- 
ing more points did not improve significantly the results; 
for example, when all 34 points were used the maximal 
error went down to 1.14 pixels and average error stayed 
at 0.42 pixels. 

Next the epipolar intersection method was applied. 
We used two methods for recovering the essential matri- 
ces. One method is by using the implementation of [19], 
and the other is by taking advantage that four of the cor- 
responding points are coming from a plane (the ground 
plane). In the former case, much more than eight points 
were required in order to achieve reasonable results. For 
example, when using all the 34 points, the maximal er- 
ror was 43.4 pixels and the average error was 9.58 pixels. 
In the latter case, we recovered first the homography B 
due to the ground plane and then the epipole v" using 
two additional points (those on the film cartridges). It 



is then known (see [26, 20]) that Fis = [v"]B, where [v"] 
is the anti-symmetric matrix of v" . A similar procedure 
was used to recover F23. Therefore, only six points were 
used for re-projection, but nevertheless, the results were 
slightly better: maximal error of 25.7 pixels and average 
error of 7.7 pixels. Figure 4 shows these results. 

Finally, we tested the performance of re-projection us- 
ing the linear combination method. Since the linear com- 
bination methods holds only for orthographic views, we 
are actually testing the orthographic assumption under 
a perspective situation, or in other words, whether the 
higher (bilinear and trilinear) order terms of the trilin- 
ear equations are significant or not. The linear combina- 
tion method requires at least four corresponding points 
across the three views. We applied the method with four, 
12 (for comparison with the trilinear case shown in Fig- 
ure 3), and all 34 points (the latter two using linear least 
squares). The results are displayed in Figure 5. The per- 
formance in all cases are significantly poorer than when 
using the trilinear functions, but better than the epipolar 
intersection method. 

6 Discussion 

We have seen that any view of a fixed 3D object can 
be expressed as a trilinear function with two reference 
views in the general case, or as a bilinear function when 
the reference views are created by means of parallel pro- 
jection. These functions provide alternative, much sim- 
pler, means for manipulating views of a scene than other 
methods. Experimental results show that the trilinear 
functions are also useful in practice yielding performance 
that is significantly better than epipolar intersection or 
the linear combination method. 

The application that was emphasized throughout the 
paper is visual recognition via alignment. Reasonable 




Figure 5: Results of re-projection using the linear combination of views method proposed by [34] (applicable to 
parallel projection). Top Row: In the lefthand display the linear coefficients were recovered from four corresponding 
points; maximal error is 56.7 pixels and average error is 20.3 pixels. In the righthand display the coefficients were 
recovered using 12 points in a linear least squares fashion; maximal error is 24.3 pixels and average error is 6.8 pixels. 
Bottom Row: The coefficients were recovered using all 34 points across the three views. Maximal error is 29.4 pixels 
and average error is 5.03 pixels. 



performance was obtained with 12 corresponding points 
with the novel view (^3) — which maybe too many if the 
image to model matching is done by trying all possible 
combinations of point matches. The existence of bilinear 
functions in the special case where the model is ortho- 
graphic, but the novel view is perspective, is more en- 
couraging from the standpoint of counting points. Here 
we have the result that only six corresponding points 
are required to obtain recognition of perspective views 
(provided we can satisfy the requirement that the model 
is orthographic). We have not experimented with bilin- 
ear functions to see how many points would be needed 
in practice, but plan to do that in the future. Because 
of their simplicity, one may speculate that these alge- 
braic functions will find uses in tasks other than visual 
recognition — some of those are discussed below. 

There may exist other applications where simplicity 
is of major importance, whereas the number of points 
is less of a concern. Consider for example, the appli- 
cation of model-based compression. With the trilinear 
functions we need 17 parameters to represent a view as 
a function of two reference views in full correspondence. 
Assume both the sender and the receiver have the two 
reference views and apply the same algorithm for obtain- 
ing correspondences between the two views. To send 
a third view (ignoring problems of self occlusions that 
could be dealt separately) the sender can solve for the 
17 parameters using many points, but eventually send 
only the 17 parameters. The receiver then simply com- 
bines the two reference views in a "trilinear way" given 
the received parameters. This is clearly a domain where 
the number of points are not a major concern, whereas 
simplicity, and robustness (as shown above) due to the 
short-cut in the computations, is of great importance. 

Related to image coding, an approach of image decom- 
position into "layers" was recently proposed by [1, 2]. In 
this approach, a sequence of views is divided up into re- 
gions, whose motion of each is described approximately 
by a 2D affine transformation. The sender sends the first 
image followed only by the six affine parameters for each 
region for each subsequent frame. The use of algebraic 
functions of views can potentially make this approach 
more powerful because instead of dividing up the scene 
into planes (it would have been planes if the projection 
was parallel, in general its not even planes) one can at- 
tempt to divide the scene into objects, each carries the 
17 parameters describing its displacement onto the sub- 
sequent frame. 

Another area of application may be in computer 
graphics. Re-projection techniques provide a short-cut 
for image rendering. Given two fully rendered views 
of some 3D object, other views (again ignoring self- 
occlusions) can be rendered by simply "combining" the 
reference views. Again, the number of corresponding 
points is less of a concern here. 
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